Senior Site Reliability Engineer

Company
Cebu Pacific
Job Location
Philippines, Asia Pacific
Job Role
Engineer
Contract Type
Full-Time
Salary
Posted Date
2026-03-13
Job Expiry Date
2026-04-12
Qualification
Skills Certificate

Key Responsibilities


24/7 Incident Command & Alerting


  • 24/7 Availability: Participate in a shift rotation or on-call schedule to ensure continuous coverage. You are the "eyes on glass" for the organization.
  • Unified Alerting: Manage the notification workflow. Ensure that Critical Alerts for both Infrastructure failures and Application failures trigger immediate notifications to the 24/7 team.
  • Major Incident Management (MIM): Lead the technical response during critical outages. Coordinate cross-functional teams to restore service rapidly.


Observability Strategy (Dynatrace Focus)


  • Dynatrace Administration: Act as the Subject Matter Expert (SME) for our Dynatrace implementation.
  • Configure Management Zones, Alerting Profiles, and Dashboards to provide a "Single Pane of Glass."
  • Utilize Dynatrace PurePath for distributed tracing to identify bottlenecks in microservices.
  • Leverage Davis AI to automatically detect anomalies and reduce alert noise.
  • Comprehensive Monitoring Scope:
  • Network Health: Monitor VPN Tunnel status, Load Balancer (ALB/NLB) health, and DNS latency. Trigger: Alert on packet loss or high latency.
  • Infrastructure Health: Monitor Disk/Volume usage, CPU/Memory saturation, and SSL Certificate expiry.
  • Security: Monitor for DDoS attack patterns and WAF spikes.


Resilience & Chaos Engineering


  • Chaos Engineering: Plan and execute Chaos Engineering exercises (e.g., simulating pod failures, network latency, zone outages) to test the system's resilience and verify that failover mechanisms work as expected.
  • Reliability Recommendations: Proactively analyze trends and provide architectural recommendations to development and infrastructure teams to improve system stability.
  • First Line Troubleshooting: Serve as the L1/L2 troubleshooter for Kubernetes (EKS), AWS, and Linux issues. Execute "Quick Fix" runbooks to mitigate impact before escalating to platform engineering.


Application Triage & Analysis


  • Deep-Dive Triage: Go beyond "system check" to perform deep analysis using Dynatrace. Analyze stack traces and exception logs to pinpoint the exact line of code causing the failure.
  • Root Cause Differentiation: Rapidly differentiate between an Infrastructure Issue (e.g., Network timeout) vs. an Application Logic Error (e.g., NullPointer caused by bad data).
  • Blameless RCA: Facilitate Root Cause Analysis sessions to ensure permanent fixes are applied to recurring problems.


Governance & Reporting (Stability Cadence)


  • Stability Calls: Facilitate and lead the Weekly/Bi-Weekly Stability Call. Present the health status of all technical towers to leadership and stakeholders.
  • Reporting: Generate regular reports on system uptime, error budgets, incident trends, and MTTR (Mean Time To Recovery).
  • Cross-Tower Visibility: Ensure that the dashboards and reports provide value to all teams (Network, App, Cloud), ensuring no siloed "blind spots" in production.


Automation & Toil Reduction


  • Remediation Scripting: Develop scripts (Python/Bash) to "Auto-Heal" common issues (e.g., clearing logs when disk is full, restarting stuck services).
  • Process Improvement: Identify manual checks and convert them into automated Dynatrace alerts or synthetic tests.


Required Qualifications


  • Shift Availability: Must be willing to work in a 24/7 shift environment or strictly defined on-call rotation.
  • Dynatrace Expertise: Deep experience administering and using Dynatrace in a production environment (Dashboards, OneAgent, PurePaths).
  • Troubleshooting Expertise:
  • Network: Understanding of DNS, TCP/IP, Load Balancing, and Firewalls.
  • Compute/Storage: Understanding of block vs. object storage, CPU stealing, and memory management.
  • Governance: Experience facilitating technical management calls and producing executive-level reliability reports.
  • Application Debugging: Ability to read application logs (Java, Node, Python) to understand why a service failed.
  • Cloud (AWS) & K8s: Solid understanding of EKS, EC2, and other AWS Services


Apply Now